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ABSTRACT OF THE THESIS 


Aren’t we all nearest neighbors? 


by 


Mark Michel Saroufim 
Master of Science in Computer Science 
University of California, San Diego, 2014 
Professor Sanjoy Dasgupta, Chair 


We start with a review of the pervasiveness of the nearest neighbor search 
problem and techniques used to solve it along with some experimental results. 
In the second chapter, we show reductions between two different classes of geo¬ 
metric proximity problems: the nearest neighbor problems to solve the Euclidean 
minimum spanning tree problem and the farthest neighbor problems to solve the 
/e-centers problem. In the third chapter, we unify spatial partitioning trees un¬ 
der one framework the meta-tree. Finally, we propose a dual tree algorithm for 
Bichromatic Closest Pair and measure the complexity of batch nearest neighbor 
search. 
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Chapter 1 


Nearest Neighbor 

1.1 Introduction 

This chapter will focus almost exclusively on the nearest neighbor problem. 
The problem is defined by two objects: the first is an ordered pair (A", d) where A" 
is a set of points x±,..., x n — X C R D and d is a distance function d : X x X — > R. 
The second is a query point q E R D . The nearest neighbor search problem is then 
formulated as 

argmin d(q, x) 

xex 

Notice for most problems we restrict d to have some sort of structure we call d a 
metric if for any p, q,u E X , d satisfies the following conditions. 

• d(p, q) > 0 with equality if and only if p — q 

• d(p,q) = d(q,p ) 

• d(p , q) + d(q, u ) > d(p, u ) 

If d is a metric we then call the ordered pair (. X, d) a metric space 
Examples of metrics are the Minkowski norm where 

d 

d(q,u ) = (^2 I* ~ u i\ P Y 

i —1 


l 
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By setting p — 2 we recover the standard I 2 norm otherwise known as the 
Euclidean norm 

D 

d(q,u ) = , ^ |qj - ut \ 2 

N i=i 

Often we will also be also be interested in using a wider class of distance functions 
called Bregman Divergences (see appendix). Of particular interest is what is called 
the KL (Kullback-Leibler) divergence which is a natural distance measure between 
distributions. 


D 

d KL (p,q ) = y Pi In¬ 

fer ® 

It is also worth noting that since good algorithms with guarantees have 
evaded researchers for nearest neighbor search, a problem that has become in¬ 
creasingly interesting is that of approximate nearest neighbor search or c-nearest 
neighbor. We now ask for a point x' that is not too far (c-approximate) from the 
optimal nearest neighbor of q. 

d(q,x') < cargmind(g, x) 

xex 

In what follows we will make it clear from context which of the two problems 
we are referring to. 

1.2 Applications 

1.2.1 Traveling Salesman 

One of the Erst successes of applying nearest neighbor search was in End¬ 
ing an efficient algorithm for the Euclidean traveling salesman problem. As a brief 
reminder, TSP is one of the quintissential AhP-hard problems where given an undi¬ 
rected weighted graph G = (V, E ) we’d like to find the shortest path that traverses 
all the vertices Vi G V without ever revisiting the same vertex twice except for the 
first one. The complexity of this problem is trivially 0(n\n), n\ to enumerate all 
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possible tours and n to check whether they are indeed tours. The greedy near¬ 
est neighbor [I] for TSP guarantees quickly finding a solution to TSP but has no 
guarantees on how bad it is from the optimal one. 

Algorithm 1 Greedy Nearest Neighbor TSP 
1 : Tour = {} 

2: Pick Arbitrary vertex v € V 

3: while V 7 ^ $ do 

4: v' = arg mm v . eV _ v d{v' , v t ) 

5: V = V - V 

6 : V = v' 

7: T our .append[y) 

8 : end while 
9: return Tour 


1.2.2 k-nn Classification 

Suppose you’ve trained a model on a training set consisting of a dataset 
Xi,... ,x n — X e H d where every point Xi is associated with some label f(xi) = 
yi G IN. Now you are given a new point q that is not yet associated with a label 
/(g). A natural way to classify q is then to find its nearest neighbor: 

arg min d(q, x) 

x£X 

and set /(g) = f(x). The process we’ve just described is nearest neighbor clas¬ 
sification | IAlt92] . it’s simple and takes 0(n) if we choose to trivially search for a 
nearest neighbor. Nearest neighbor classification could be very sensitive to outliers 
but it is easy to make it more robust if we repeat the process k times and use a 
simple voting scheme (majority) to decide the label of q. More generally we can 
associate a prior weight Wi on the i’th label and multiply that by the number of 
nearest neighbors of q that were labeled i. 


/(g) = argmaxluyn;} 
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There also exists schemes where the voting power of a point is inversely 
proportional to its distance from the query point q. As an example below in figure 
1.1| we project the iris dataset onto a two dimensional plane, set the number of 
nearest neighbors k — 18 and then use a kd tree to find them, the voting power of 
every point is inversely proportional to the distance from the query point q. 

3-Class classification (k = 18, weights = 'distance', algorithm = 'kd_tree') 

5.0 
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Figure 1.1: fc-rm classification on iris dataset 


1.2.3 N-body problems 

N-body problems have fascinating origins in Newtonian mechanics, specifi¬ 
cally suppose we are trying to understand the interaction between N spatial bodies. 
Newton’s law of universal gravitation tells us that the bodies with masses m\ and 
m2 at a distance r from each other, attract each other with a force 

„mim2 


where G is a gravitational constant. This seems easy enough but now suppose we 
have more than two bodies and now have to deal with N such bodies. Since the 
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force F between two bodies is stronger the closer they are we can choose to look at 
the k closest bodies and compute F between a body q and its k nearest neighbors. 
Why not just compute F to all bodies? Astrophysicists estimate the number of 
stars in the milky way alone to be a 100 billion, this puts us in a range where O 
starts to matter. 

1.2.4 Single Linkage Clustering 

Clustering is a fairly ubiquitous problem in machine learning it can be 
thought of as a dimensionality reduction problem where we try to reduce the size 
of a dataset from n to k where k << n. The remaining k points are called 
cluster centers. An approach to clustering uses nearest neighbor search as its main 
subroutine, UPGMAM [LL98j clustering is an agglomerative hierarchical clustering 
technique where we start with n cluster centers (one for each data point) we then 
find Siig y min xe xd(x, y ) and merge y to the cluster center that x is assigned to 
and repeat this process until all points belong to the same cluster. We can also 
choose to terminate before all the points belong to the same cluster to get k cluster 
centers instead of 1. 


1.3 Algorithms 

Let’s restate the nearest neighbor problem 

argmin d(q, x) 
xEX 

The trivial solution will compare d(q,x) for all i6l and pick the smallest one. 
This approach takes 0{n) time and is fine given that we are only doing a nearest 
neighbor query once. However, assuming that we have not just one but qi,.. ■ ,q m — 
Q, then the naive approach will now take 0(n 2 ) which is extremely large for our 
purposes. Instead we’d like to take a similar approach to the one seen in sorting 
algorithms where we incur some sort of cost P{n ) to build a nearest neighbor data 
structure and then answer nearest neighbor queries in T(n) = o(n). This results 
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in a total running time of 0(P(n) + \Q\T(n )) where \Q\ is the size of the query 
set. 

1.3.1 Tree Based Techniques 

In what follows we will introduce several spatial partitioning schemes for 
the purpose of fast nearest neighbor retrieval. All the below examples are binary 
spatial partitioning trees where the two subtrees of any given node are determined 
using some sort splitting rule to hierarchichally divide up the dataset A" into a 
binary spatial tree T x using a BuildTree(X) routine after which we answer nearest 
neighbor queries of points q using NNS(q, T x ) 

1. Comprehensive Search (shown in [2]) where we are conservative about pruning 
out subtrees and potentially could visit all 0(n) points. 

2. Defeatist Search where at every iteration we prune out an entire subnode of 
the binary partition tree so that we visit at most 0(\ogn) points 

Even though Comprehensive Search always finds the true nearest neighbor 
(Defeatist Search has no such guarantees), its time complexity is the same as a 
trivial linear scan. For that reason Defeatist search is instead used in practice and 
unless explicitly mentioned all nearest neighbor search schemes in this text will 
be of the Defeatist nature. The difference as far as implementation is concerned 
is minimal, below is the Comprehensive Nearest Neighbor shown in algorithm [2] 
.To recover Defeatist Nearest Neighbor Search we simply omit the the last return 
statement. 

kd-trees 

The canonical example of a spatial partitioning tree is the kd -tree lBen75i (k 
dimensional tree) which divides the dataset by the median of one of the coordinates 
essentially recursively splitting the size of the data-structure by 2 at every level of 
the tree [3j With the data so cleanly separated, it’s easy to navigate the tree T x 
looking for the nearest neighbor of a query point g, pseudocode here [4j Because q 
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Algorithm 2 Comprehensive Nearest Neighbor Search 
1: procedure NNS (q,Tx) 

2: if T x is a leaf then 

3: return arg miix^g x d(q, x') 

4: end if 

5: if x G Left(T x ) and x ^ Right(T x ) then 

6: NNS (q,Left(T x )) 

7: else if x ^ Left(T x ) == NULL and x G Right{T x ) then 

8: NNS(g, Right{T x )) 

9: else 

10: return argmin xeX (NNS(g, Left(T x )) , NNS(g, Right(T x )) 

11: end if 

12: end procedure 


can only be on one or the other side of the median of one of the dimensions, the 
size of the search space is halved at every iteration. Once we reach a leaf node we 
can naively compare the distance between q and all points p G leaf and return the 
smallest such distance. Constructing the k-d tree takes T{n) = 0(n ) + 2T(n/2) = 
0{n\ogn). In the pseudocode below we will use x l to denote the i’th coordinate 
value of x. 

Queries for nearest neighbor of a point q G Q from R can then be answered 
in (9 (log n) using algorithm |4j 

Unfortunately the analysis above is flawed in fact there is no prior guaran¬ 
tee that our data can be so cleanly separated into halves! In the worst case we 
can expect to recurse on all nodes in the tree, bumping up the query time for NNS 
from 0(logn) to 0(n). Might as well just naively search for the nearest neighbor. 
The problems stem for the inadequacy of k-d trees to give any structure to high 
dimensional spaces. More generally, any spatial tree that will use coordinate direc¬ 
tions will be inadequate for nearest neighbor search. We will consider an example 
proposed in [ DS14 j. q is our query point and the dataset is x\,... , x n — X. Take 
Xi = (1,..., 1) and for the other points Xi G X — {ay} pick a random coordinate 
uniformly at random and set its value to M where M is some large constant and 







Algorithm 3 Constructing a k-d tree kd(X) 

1: Find Median dimension of some dimension i med l (X) (typically max-variance) 

2 : repeat 

3: For all x E X 

4: if x l > med l (X ) then 

5: Add x to RightiTx) 

6 : else 

7: Add x to Left(T x ) 

8 : end if 

9: until All x G X have been considered 
10: kd(Left(X)) 

11: kd(Right(X )) 


Algorithm 4 Nearest Neighbor Search using k-d tree NNS(q,Tx) 
1: Sort med l (X) for all A" 

2: Set i — 0 

3: repeat 

4: if Tx is a leaf then 

5: return argmin x , eT d(q,x') 

6: end if 

7: if (f > med l (X) then 

8 : i = i + 1 

9: X = RightiTx) 

10: NNS(q, Right{T x )) 

11 : else 

12: i = i + 1 

13: X = LeftiT x ) 

14: NNS(q,Left(T x )) 

15: end if 

16: until i — D or T x is a leaf 
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set the other points to random values picked uniformly from [0,1], we can see here 
that any coordinate split will create a large separation between q and X\ 

PC A Trees 

The idea behind PCA trees [khT2] is similar to that of k- d trees but instead 
of splitting according to the medians of the dimensions we split according to the 
principal eigenvectors of the data’s covariance matrix. In fact we can simply use 
the same construction scheme as k- d trees but change the split rule to the location 
of our points relative to the principal eigenvector and of course apply it recursively 
to the left and right children as shown in algorithm [5] 

Algorithm 5 Nearest Neighbor Search using a PCA tree NNS(q,T\) 

1: Sort in descending order the principal components Ai,..., A& with correspond¬ 
ing eigenvectors Ui,... ,Uk of covariance matrix £ 

2: Sort med l {X) for all i[ 

3: Set i = 0 
4: repeat 

5 : if q.Ui > 0 then 

6 : i — i + 1 

7: NNS(q,Right(X )) 

8 : else 

9 : i — i + 1 

10: NNS(q,Left(T x )) 

11: end if 

12: until i = k 


PCA trees unfortunately can be fooled even by relatively simple datasets, 
consider for instance an arrangement of points x±,... ,x n G R 2 organized in two 
parallel lines with the first coordinate axis. The distance between two successive 
points on the first line is 2 and the distance between two successive points on the 
second line is 1. The distance between the two lines is 4 Since there is a large 
amount of data parallel to the first axis, the first axis will indicate the direction 
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of the first principal component. Once we project our data onto this principal 
component we will interleave points from the two parallel lines onto the same line 
and if we set q to be any point in the dataset we will always get an incorrect nearest 
neighbor. 

Random Projections Trees 

Random Projection |DF08| Trees are essentially k-d trees where the splitting 
rule is done according to a random direction in the dataset (as shown in algorithm 
[6j) instead of the max variance coordinate like in k-d trees. The construction of the 
datastructure is outlined below and nearest neighbor search is identical to search 
in a k-d tree. 

Algorithm 6 Nearest Neighbor Search using rp tree NNS(q,Tx ) 

1: i = 1 

2: Draw uniformly at random a direction w from D — i dimensional sphere 

3: repeat 

4: if q.w > 0 then 

5: i = i + 1 

6: NNS(q,Right(T x )) 

7 : else 

8: i = i + 1 

9 : NNS(q,Left(T x )) 

10 : end if 

11: until i — k 


In the paper, the authors show how rp-trees can adapt to the intrinsic 
dimension of a dataset where the intrinsic dimension is defined using the local 
covariance dimension which is a measure of how well the covariance of the data is 
captured using d eigenvalues of the covariance matrix where d « D and D is the 
actual dimension of the dataset 
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2-means Trees 

This tree has a fairly simple splitting rule |BB95j . instead of splitting along 
a random diretion or the max variance direction, we divide up the dataset into 
two clusters and set the split rule as the midpoint between the two cluster centers. 
To obtain the two clusters C\ and C' 2 , their respective centers /j 1 and /J 2 and the 
points points assigned to then x G Cj we run the /c-means algorithm on our dataset 
and set k = 2 

2 

arg min V V \\x - Hj\\ 


Spill Trees 


A spill tree is not a spatial partitioning tree in of itself because any of the 
trees we’ve discussed so far can also be spill trees. A common problem among 
spatial partitioning trees is that points near the decision boundary of splits can 
be seperated from their neighbors [ DS14j [MLll j. However, by allowing spill i.e 
overlap between the right and left subtree of a given node we can limit the prob¬ 
lems associated with a hard partitioning (See algorithm [T] . We can create a soft 
partitioning by maintaining two decision boundaries split + r and split — r. If 
a given datapoint is to the left of split — r then we assign it to the left subtree. 
The interesting case is when a datapoint lies between the two decision boundaries, 
when that happens we simply assign the point to both subtrees. It is worth noting 
that allowing spill means we will have duplicate points across different leaves mean¬ 
ing we will slow down nearest neighbor queries. Spill trees serve no real purpose 
if we’re doing a full search but can dramatically improve the results of defeatist 
search at an extra time cost. So adjusting the size of spill essentially gives us an 
easy way to set a tradeoff between the running time of nearest neighbor queries 
and the quality of the found nearest neighbors. 

We summarize the above techniques by first stating the general spatial 
partitioning tree algorithm regardless of the splitting rule used. As a reminder, a 
splitting rule is a function / : X —» {0,1} that assigns a D dimensional point to 


one of two subsets of nodes Left or Right. We include table 1.1 that shows the 
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Algorithm 7 Nearest Neighbor Search using spill rp tree NNS(q,T x ,r) 

1: i = 1 

2: Draw uniformly at random a direction w from D — i dimensional sphere 

3: repeat 

4: if q.w > t then 

5 : i = i + 1 

6: NNS(q,Right(T x )) 

7: else if q.w < —r then 

8: i — i + 1 

9: NNS(q,Left(T x )) 

10 : else 

11: i — i + 1 

12: return argmin x NNS(q, Left(T x )), NNS^q, Right(T x )) 

13: end if 

14: until i — k 

Table 1.1: Summary of Split rules for different spatial trees 

fcd-tree pea-tree rp-tree 2means-tree 

arg max e J2 x ( e I( x ~~ I 2 )) 2 argmax^ v t TjV s.t ||u 2 || = 1 S D ~ 1 fi\ — /i 2 

different partitioning rules as they were presented in | IML11| 

1.3.2 Hashing Based Techniques 

Another completely different approach to solving nearest neighbor problems 
is Locality Sensitive Hashing (LSH) [ PI98] . Before we introduce the framework we 
will introduce some basic terminology. Locality Sensitive Hashing is defined over 
a family of hash functions. 

Definition 1. We call a family R, (R, cR, Pi, P 2 )-sensitive if given two points 
p, q G R D 


1. if \\p — q\\ < R then Pr y[h{q) = h(p)\ > Pi 

2. if \\p — q\\ > cR then Pr n[h(q) = h(p )] < P 2 
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The actual [8] then chooses L hash functions composed of a concatenation 
of k hash functions from the family T~L to achieve performance close to a constant 
factor away from the theoretical optimal performance of hash based schemes. Now 
analogously to the tree based approaches we first construct a datastructure in this 
case the hash tables 

Algorithm 8 Locality Sensitive Hashing LSH{H, L, R,c) 

1: Draw hij from a family of hash function "H 

2: Choose L hash functions gi,... ,g L where gj = (hij,... h^j) 

3: For every point ay G X C 1Z D , hash it into L different hash tables by evaluating 
9j(xi) 

4: j = 0 

5 : while j < L do 

6: Retrieve points hashed into gj(q ) 

7: Compute the distance to all retrieved points to q, if any of the points is a 

cR nearest neighbor then return it and terminate 

8 : j = j + 1 

9 : end while 


1.4 Experiments 

In this section we validate experimentally which spatial tree data structures 
perform well on real data. We will use n to denote the number of samples, D to 
denote the dimensionality of the dataset and c to denote the number of possible 
labelings. 

1. Pima Indians diabetes dataset n = 768, d = 10, c = 2 

2. OptDigits dataset n = 5620 ,d = 64, c = 10 

On each dataset we run four different spatial trees k-d trees, rp trees, PCA 
trees and 2-means trees implemented in [ ML11 ] where we control two parameters 
the first is spill which we tested for 3 values 0%, 0.05%, 0.1% and the second is the 
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maximal number of comparisons we will make or the max allowable size of a leaf 
node. We then plot the number of comparisons made vs the probability of finding 
the nearest neighbor which we calculate as the ratio of the sum of ranks of the brute 
force algorithm over the sum of ranks reported by the spatial tree. As an example 
suppose we’re looking for the two nearest neighbors of a query point q, brute force 
search will return the correct ranks 1,2 whose sum is 3. Our datastructure might 
return the i’th and j’th nearest neighbor instead. The ratio then becomes We 


report the results in figure [L2] for the Pima dataset and figure |1.3| for the OptDigits 
dataset. We also judge performance based on classification error using fc-nn with 
k = 10 vs number of comparisons in figure L4 for the Pima dataset and figure [T5 
for the OptDigits dataset. 
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rp tree on Pima dataset with spill — 0 












Figure 1.2: Probability of finding nearest neighbor vs number of comparisons for 
different spatial trees and tree configurations for the Pima dataset 
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Figure 1.3: Probability of finding nearest neighbor vs number of comparisons for 
different spatial trees and tree configurations for the OptDigits dataset 
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Figure 1.4: Error Rate on Classification vs number of comparisons for different 
spatial trees and tree configurations for the Pima dataset 
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Figure 1.5: 


Error Rate on Classification vs number of comparisons for different 


spatial trees and tree configurations for the OptDigits dataset 

















Chapter 2 


Reductions 


This chapter will discuss reductions between a wide class of geometric prox¬ 
imity problems, we will use the notation P < Q to say that problem P reduces to 

Q- 

2.1 Nearest Neighbor Reductions 

As a general outline, in this section we’ll be looking to solve the Euclidean 
Minimum Spanning Tree problem via a reduction to Bichromatic Nearest Neighbor 
Search which we prove to be equivalent to Nearest Neighbor Search. 


2.1.1 Bichromatic Nearest Neighbor 

We introduce the new problem of Bichromatic Bearest Neighbor (BNNS), 
the setup is very similar to NNS but now every point Xi is also associated with a 
color x( x i) *= {0, 1} and we’d like to find the nearest neighbor of a query point q 
such that the returned point is of a different color from q. More formally: 


argmin d(p,x ) 
zex|xO?)^x(M 


NNS < BNNS 

The reduction is shown in algorithm [9j 
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Algorithm 9 NNS algorithm via BNNS 
1: X = { Set x{q) — 0 an d Set x( x ) — 1 f° r x E X s.t x j- q } 

2: T x = BuildTree(X) 

3: return BNNS(q,Tx ) 


BNNS < NNS 

This simple reduction is described in algorithm [TOj 

Algorithm 10 BNNS algorithm via NNS 
1: T x = BuildTree({x E X : x(x) ^ x(<?)}) 

2: return NNS(q,T x ) 


2.1.2 Chromatic Nearest Neighbor 

A natural problem that follows from BNNS is Chromatic Nearest Neighbor 
(CNNS) where we are trying to solve the same problem as BNNS but we could have 
more than one color or in fact a countably infinite number of colors x(Xi) —> IN. 
More practically though the number of colors we can have is bounded by the 
number of points in our dataset. 


argmin d(p,x ) 
xeX\x(q)^x(x) 


We will now show an equivalence between the two problems in 11 


CNNS < BNNS 


Algorithm 11 BNNS algorithm via NNS 
1: Tx = {x ex : x(x) ± x(g)} 

2: return BNNS(q,T x ) 


BNNS < CNNS 


This reduction is trivial we just run CNNS, it is shown in 12 CNNS guar¬ 


antees that x( x ) ~f~ x(q) no matter how many possible colors we have. 
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Algorithm 12 BNNS algorithm via CNNS 
1: T x = BuildTree(X) 

2: return CNNS(q,T x ) 


2.1.3 Euclidean Minimum Spanning Tree 


The minimum spanning tree problem is a classic graph problem where given 
a connected weighted graph G = ( V., E ) we’d like to find the edges ST C E that 
reach every v G V such that the weighted sum of the edge Yhe&ST w ( e ) is minimized. 
The Minimum Spanning tree problem is defined over graphs with arbitrary distance 
functions but we can restrict our attention to the Euclidean / 2 norm to recover the 


Euclidean minimum spanning tree problem. As an example 2T is a Euclidean 
MST built on random uniform data. 
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Figure 2.1: Euclidean MST on random uniform data 
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2.1.4 Boruvka’s algorithm for MST 

In algorithm [l3{ we show how to use BNNS to solve the Euclidean Minimum 
Spanning Tree Problem via Boruvka’s algorithm. It is also worth noting that we 
could have adapted the more standard algorithms by Prim or Kruskal like in [indOO] 
but the subroutine one would use would be Bichromatic Closest Pair which can be 
formulated as 

arg min d(x,x') 
x,x'eX\x(x)^x(x') 

In fact one can trivially solve BCP in 0(n) queries to BNNS. 


Algorithm 13 Boruvka’s algorithm for MST 


1: 

procedure Bor(V,E) 


2: 

T = (v \,..., v n ) for all v 

G V > Initialize T to be the set of one vertex 


trees 


3: 

while T > 1 do 


4: 

for each C G T do 

> C stands for components of T 

5: 

S = 0 

> S' is a set of edges 

6: 

for each vertex v 

G C do 

7: 

x' = arg min T ^ c d(v, x) 

8: 

S .appendix') 


9: 

end for 


10: 

e! = arg min eeS w[ 

[e) > w(e) is the weight of edge e 

11: 

T.append(e') 


12: 

end for 


13: 

end while 


14: 

return T 

> T is now the MST of (V, E) 

15: 

end procedure 



Theorem 1. Euclidean Minimum Spanning Tree can be solved with 0(n 2 ) queries 
to Chromatic Nearest Neighbor Search and 0(n) queries to Chromatic Closest Pair 
with at most 0(n\ogn ) updates to each data structure. 
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Proof. To implement the above algorithm we maintain a CNNS structure on 
our set of points. Whenever we make a call to x' = argmin X ^ c d(v,x) we query 
CNNS(v, Tx) which will guarantee a point from a different cluster and we repeat 
this 0\V\ times. (Of course this last step could also be done with 1 call to a 
Chromatic Closest Pair Data-Structure). Then whenever we merge two clusters 
q, Cj G T we have to recolor the points v G argmin c , {|q|, |q|} i.e we only recolor 
the points in the smaller of the two clusters. With that trick we will recolor a point 
at most 0(logn) times instead of the trivial 0(n). Therefore we will be recoloring 
all points at most 0(nlogn) times. The correctness of the algorithm follows from 
the correctness of Borukva’s algorithm □ 

2.2 Farthest Neighbor Reductions 

2.2.1 Farthest Neighbor Search 

So far we’ve discussed reductions to nearest neighbor but another natural 
problem is that of Farthest Neighbor Search. Given aq,... ,x n — X C R D and a 
query point q G R 13 find its farthest neighbor. 

arg max d(q,x) 

x£X 

This problem has several curious properties the first is that it is equivalent 
to hireling the minimum enclosing ball centered at q, no direct analogue is readily 
available for the nearest neighbor problem. The minimum enclosing ball problem 
can be formulated as a convex optimization problem. 

min r s.t d(q, xf) <r i — 1,..., n 

Farthest Neighbor Search is used as a subroutine for an approximation 
algorithm for the ^-centers problem. 
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2.2.2 k- centers 


We first introduce the fc-centers problem: Given n i.i.d data points from 
a set S C X where A" is a metric space. Find k representative centers of your 
data set according to the cost function cost(T ) = max,,., min keT d(xi, k). k- center 


is NP-hard even in 2-d spaces. The following beautiful 14 due to Gonzales called 
farthest first traversal | Dasl3j uses furthest neighbor search as a subroutine to 
approximate the A>centers problem. 


Algorithm 14 Farthest First Traversal 


1: Input: xi,... ,x n £ R D 

2: T = {} 

3: pick an arbitrary 

4: repeat 

5: z — argmax^gs d(x,T) 

6: T = TU{z} 

7: X = Z 

8: until \T\ = k return T 

> List of /e-centers 


The algorithm is incredibly simple, pick an arbitrary point and make it 
a cluster center. Find the farthest point from your original point and make it a 
cluster center, do this k times and you have a 2-approximation of k center. Finding 
the farthest neighbor takes 0{n) trivially repeating this k times gives us a running 
time of 0{kn). Now let’s prove why this guarantees a 2-approximation. 

Theorem 2. Furthest First Traversal is a 2-approximation of the k-center problem 

Proof. Surprisingly the proof is again very simple. Let’s consider what the worst 
case might look like if we constrain ourselves to only picking cluster centers from our 
data points. Take two points x, y on a 2 D plane that are very far away from each 
other. If we pick a cluster center from our data points then costfT) = d(x,y ) but 
had we been able to pick a cluster center outside of our data points then the optimal 
algorithm would have simply set as the cluster center the middle ground between 
those two points for a cost(T*) = d(x,y)/2. Therefore, cost(T ) < 2 cost(T*) □ 








Chapter 3 


Rethinking Nearest Neighbor 

3.1 Dual Trees 

The algorithms we discussed above have the same limitation, suppose that 
we’re answering a lot of queries \Q\ — n then somehow our pruning rules are 
redundant (we might re-traverse the tree in a very similar way to find a nearest 
neighbor). The answer is can we exploit some sort of structure among the points 
q G Q? The answer is yes we can build another tree for the query points. We 
call such algorithms dual-tree algorithms and we now show how to use a query 
tree T q and a reference tree T r constructed as is done using cover trees to speedup 
Nearest Neighbor Search. As an example we mention cover trees [AB06], we will 
not cover the construction of cover trees here but we will mention the interesting 
invariant that the datastructure guarantees. Given a node at level j we can bound 
the distance to any of its successors by A. This suggests the following pruning 
algorithm using cover trees [l5j We denote N q as nodes in the query tree T q , in the 
case of cover trees N q is just a single point q. p G T r is the best candidate nearest 
neighbor of q so far and x is the set of all points beneath a given node in T r . 

This captures the intuition behind if query points are close to one another 
then they’re likely to have similar nearest neighbors, the pruning is conservative so 
we are sure not to miss any candidates. In what follows we will show that a dual 
tree can be adapted to any of the tree spatial partitioning algorithms by adding 
one layer of abstraction in the form of a meta-algorithm. 
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Algorithm 15 Prune(N q , N r ) 

1: if d(p, q) — < d(p, x) + T for all q G N q then 

2: N r does not need to be explored 

3: else 

4: Prune(Child(N q ), N r ) 

5: Prune(N q ,Child(N r )) 

6: end if 


3.2 Meta Algorithms 

In the previous chapter we discussed reductions between geometric proxim¬ 
ity problems, the promise is that a good datastructure that performs on any one of 
these problems could be adapted to work on the others. However, most of the tree 
datastructures that we’ve discussed are surprisingly similar and were proposed to 
be unified under a single meta-tree jC'MR + 13| . They share: 

• A search tree (e.g: k -d tree pea tree ...) 

• BaseCase() that determines what is to be done with a combination of points 
(e.g argmin xGX d(x,q)) 

• Score() that determines whether a certain subtree should be pruned or not 
(e.g xf > median 6, ...) 

Where each node N in the tree contains a convex subset of S. As an exam¬ 
ple, a cover tree falls under this framework because sets consisting of single points 
are considered to be convex, k-d trees also fall under this framework since they 
partition a space into boxes which are also convex. 

The idea here is that if we look at the nearest neighbor literature we notice that 
there are many different tree based datastructures that were proposed to solve the 
problem. A typical paper would highlight the shortcomings of some of the past 
datastructures (more often than not k-d trees inadequacy in high dimension) and 
then propose their own datastructure whose validity they verify with an exper¬ 
imental analysis and a theoretical one. The problem is that there seems to be 







27 


a lot redundant work and it would be nice to have one meta-algorithm that can 
reproduce the datastructures we covered in chapter 1 and then analyze this meta¬ 
algorithm and have our results automatically carry over to the others. In the two 
bounds below that draw balls B around a node N q where D q [k\ is the distance 
between q and its fc’th nearest neighbor (so far) and V? is the set of points that 
are descendants of the node N q the p superscript is there to specify that D? is a 
set of points and not a set of nodes. X(N q ) is the radius of the convex hull of N q 
with 2A (N q ) being the diameter i.e the maximum interpoint distance. 

B\(N q ) = ma xD q [k\ 

B 2 (N q ) = min D q [k] + 2X{N q ) 

It is easy to see why the spatial trees we have so far satisfy these bounds, 
in particular the cover tree example we covered at the beginning of this chapter 
uses the second. 

3.3 Random rp-Tree Forests 

Our experiments show that rp-trees are robust in helping us find the nearest 
neighbor of a query point q in high dimension. Now instead of just randomly 
splitting according to a random direction chosen from the unit circle we can build 
several rp-trees say k of them which will be different with high probability [TBJ 
Then perform NNS(q , Tj) for i — 1,..., k which will each return a leaf node 
from which we will calculate arg min :I , 6U fc_ jy. d(q,x). 
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Algorithm 16 Random rp-Tree Forest 
1: i — 0 

2: while i < k do > k is the number of trees we will build 

3: Ti = BuildRp(X) 

4: Ni = NNS(q, Ti) 

5: i = i + 1 

6 : end while 

7: N = U i =1 Ni 

8: return arg min rgiV d(q, x) 


Of course we can also consider cases where every rp-Tree could be a spill 
tree with a different spill percentage so instead of calling Ti = BuildRp(X) we 
can call T* = BuildRp(X, a,) where a* is the spill for Tj. The above highlights 
the power randomness affords us in reasoning about the proximity of points in 
Euclidean space we summarize these effects below. 

1. Random Projection as a dimensionality reduction preprocessing step via the 
Johnson-Lindenstrauss Lemma 

2. Random Partition of space via Rp Trees 

3. Randomness over size of spill 

4. Randomness over datastructure via Random Forests 

The first point can be understood from |SD02] , the second and third from |1DS14 ]. 
The fourth introduces some complexity: In fact Rp-Trees only have two hyperpa¬ 
rameters to tune, the maximum allowed depth of the tree or the maximum size of 
a node which is thankfully inversely related to the maximum allowed depth so in 
practice we only need to tune one of the two parameters. However, with k trees 
we now have 0(k) more parameters to tune which introduces difficulties both at 
the theoretical and practical level: its not clear a priori which arrangement of hy¬ 
perparameters might yield better results. For lack of a clear understanding of the 
fourth point we leave it here as an open problem. 
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3.4 Bichromatic Closest Pair 


We now introduce the Bichromatic closest pair problem 


argmin d(x,x') 

x,x' eX\x(x)^x(n') 

This problem is central to the EMST reductions in jlndOOj via Kruskal and Prim’s. 
We turn our attention to this problem in this section because it seems like a natural 
problem for dual trees, we can think of the color of a point x( x ) *= {0, 1} as 
equivalent to setting a point to one of two sets the first a query set Q and the 
second a reference set X. 


Below we propose algorithm 17 for full BCP , so the algorithm takes as 
input a list of points Xi,... ,x n G R D where every point x % is assigned one of two 
colors x( x i) £ {0,1} and returns a sorted by value dictionary where the keys are 
a pair of points and the values are the distances between those two points. 


Algorithm 17 Dual Tree BCP 
1: Set Q = {x\x{x) = 1} 

2: Set R = {x\x(x) = 0} 

3: L = DualTree(Q, R) 

4: return arg min x x , eL d(x, x') 


3.5 Batch Nearest Neighbor 

We’re again in the setting where we have a set of query points Q and 
a reference dataset X and for every element q G Q we’d like to find its nearest 
neighbor x G A". One of two things could happen, suppose the query points are very 
well clumped together for any q iy qj G Q , d(q t , q,j) < e then we obtain an equivalence 
between batch nearest neighbor and nearest neighbor. The intuition here is that if 
query points are close to one another then information about the nearest neighbor 
of one query point implies information about the other points nearest neighbors as 
well. To see why this is the case, suppose d(qi, qj) < e and x* = arg min xeX d(q, x) 
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this tells us that d(qi, qj) + d(qi, x*) > d(qj,x*). If d(qi,qj) < e ~ 0 then d(qi,x*) > 
d{qj,x*). 


We can also imagine datasets where the query points might be far apart, 
in this case query points are unlikely to share nearest neighbors but given that 
we know the nearest neighbor(s) of a given point and we know that d(qi,qj ) is 
large we can prune out the nearest neighbor of g* when looking for the nearest 
neighbor of q r A good approach to pruning here is to draw a ball around every 
point q E Q and then given a candidate nearest neighbor that lies outside that ball 
automatically reject it. 

The two points highlighted above will allow us to construct an algorithm 
for batch nearest neighbor search, the algorithm needs two constants (for the /3 
case we can take points that are at least the radius of the convex body away to be 
very far away so set fd = radius(Q) the first a sets a threshold where we basically 
consider two query points q, q' to be nearest neighbor equivalent if d(qi,qj ) < a 
and the second /3 that splits one node of the query tree into two subnodes the first 
comprising of all q, q' s.t d(q, q r ) < j3 and the second q, q' s.t d(q, q') > fd 

First we construct the query tree T q to exploit the two intuitions above in 


the following way 18 For example if /3 = median 1 (Q) we recover the k-d tree. 


Algorithm 18 Building a query Tree BuildT q (N q ) 

1: Set root{T q ) = Q 
2: while \T g \ > constant do 
3: if If d(q, q') < a then 

4: Merge(q,q') 

5: end if 

6: Set LeftTree = BuildTq({q G Q\d(q, q') < /?}) 

7: Set RightTree = BuildTq({q E Q\d(q,q') > /?}) 

8: end while 


Now given a query tree T q we build a reference tree T r on the dataset X 
using any reasonable splitting rule and perform nearest neighbor queries on subsets 
N q C Q and N r C X [19} We will have two functions that act on a subset of query 
points N q 
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1. Merge(N q ) which merges all q G N q if Clumped(N q ) = True 

2. Split(N q ) which splits N q into a right and left subtree if Spread(N q ) = True 


Algorithm 19 DualNNS(N q , N r ) 

1: if |iVq| = 1 then return NNS(N q , X) then 
2: end if 

3: if Clumped(N q ) = True then 
4: Merge(N q ) 

5: end if 

6: if Spread(N q ) = True then 

7: Right, Left = DualN NS ((Right, N r ), DualNNS(Left, N r )) 


8 : end if 


Now let’s try to take a closer look at algorithm [T9] by introducing a notion 
of difficulty for batch nearest neighor search. 


3.6 Measuring the difficulty of Dual Tree NNS 


In a recent paper [ DSl4| the authors propose a potential function to measure 
the difficulty of exact NNS. The setting is the usual one we have a query point 


q G R D and a dataset x\,...,x n € X C R D and we’d like to find the nearest 
neighbor to q in X 



Where X(k) denotes the fc’th nearest neighbor of q. Upon further inspection of 
this function we can see that when its close to 1 then all the points are more or 
less the same distance around q and we can expect nearest neighbor queries to 
be difficult. On the other hand, when 0 is close to 0 then this means that most 
of the points are far away from the nearest neighbor and intuitevely we’d expect 
nearest neighbor to become easy. The authors determine in fact that the failure 
probability of an rp-tree is 0 log and that of a spill tree is 0. Generally though 
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we will be considering nodes N C X in our spatial tree so we will add a simple 
modification: 


<t>m(q, {xi, ■ ■ -,Xn}) 


y^ Ik ~ ^(1)1 

q ~ X V 


Their results easily extend to the k nearest neigbhor case, a simple modifi¬ 
cation is made to the potential function 0 which now becomes: 


0fc,m( , b {kl> ■ ■ ■ ■ 3'n}) 


1 (Ik -^(i)ll + ••• + II q-X( k) )/k\ 
m ikt i q ~ x (0 


The k nearest neighbor expression is unfortunately cluttered so we will drop 
it and w.l.o.g set k = 1 while bearing in mind that the extension does not present 
any difficulties. For our contribution we first propose an extension to batch nearest 
neighbor by again simply modifying the potential function 0 where x qi denotes the 
nearest neighbor to q t . We define x q \ as the i’th nearest neighbor to query point 
q v the potential function for a specific query point is then: 


1 


m 


Qi~ x fl) I 


m Z — J qt — x,\ 

i=k +1 ^ W 

The potential for Q is just the sum over all the points g* G Q (or more 
generally all query points q G N C Q): 


01 (Q) ^ ^ 

Qi^Q 

The above expression takes into account how easy it is to find the nearest 
neighbors of a given a set of query points but it does not seem to exploit any 
structure from the query set Q. However, in the dual tree setting we know that 
the closer query points are the easier batch nearest neighbor queries become and 
we can represent this intuition as another potential function on the query set Q. 
Here we have two possible candidates either we can look at the average interpoint 
distance which trivially takes 0(n 2 ) trivially to compute. 
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MQ ) = J2J2 

i=l j=i+l 

Alternatively we can use the diameter of Q which again takes 0(n 2 ) to 
compute. 


02 (Q) = max ||^ - qj || 

Qi,Qj 

We now have all the components we need to rewrite a potential function 
for exact batch nearest neighbor search. 

<P(Q,X) = MQ)MQ) 

The new bound we then propose on batch nearest neighbor search via rp- 
trees is 0102 log -y • 0i has a linear dependence with \Q\ which is alarming because 
the previous bounds had no dependence on the number of points in our dataset, 
but it is intuitive because if 02 is large indicates that Q is not well clumped 
together then we expect the complexity of our problem to increase linearly with 
the number of query points and should there be a structure to Q our bound will 
reap its rewards. An algorithm for batch nearest neighbor search would then be 
one that at every iteration finds 

arg min 0102 log — 
q 01 

and then removes the found query point q from the dataset and performs 
a linear search instead to find its nearest neighbor (s). 



Appendix A 


Mathematical Background 


Definition 2. A set S C R D is convex if for any ui,... ,u k £ S and wi ,..., w k > 0 
s.tw i H-h w k = 1 

k 

y, UiWi e S 

i =1 

Definition 3. The bregman divergence of a function f is 

df(x,y ) = /(x) - f(y) - (x,y) 

Lemma 1 (Johnson-Lindenstrauss). Given 0 < e < 1 and a dataset xi,... ,x n — 
X C K D , there exists a linear map f : R D —> R d s.t for all Xi, Xj G A" 

(1 - e)||xi - Xjll 2 < ||/(xi) - /(xj)!! 2 < (1 + e)||xi - ^|| 2 

In a nutshell the Lemma says that we can randomly project datapoints on a 
lower dimensional plane and still not distort interpoint distances too much, a proof 
of the lemma can be found in |SD02] . The lemma finds its way into many machine 
learning algorithms as a preprocessing step to limit the curse of dimensionality and 
spatial trees are no exception. 
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